Just for fun, extract topics from the comments downloaded from the FCC on Net Neutrality, and use the Algorithmia API to do an LDA analysis for topic extraction.


In [24]:
from urllib.request import Request, urlopen
import json #for submission
import sqlite3 as sql #sql database connection
import pandas as pd #dataframes
import re #regular expressions
from numpy import random #for selection of sample

#vars
comment_db = '/run/media/potterzot/zfire1/data/fcc/nn_comments.db'
api_base = 'http://api.algorithmia.com/api/'
api_key = 'Your key here'
num_topics = 3 #number of topics we want from the LDA
sample_size = 1000 #number of comments to analyze. Population is 446719

We want to pull the comments, and we're not really doing any selection processing of comments based on other features, so just pull the comments themselves.


In [2]:
#Fetch the comments from the database
sqldb = None
sqldb = sql.connect(comment_db)
    
with sqldb:
    cursor = sqldb.cursor()
    comments = cursor.execute('SELECT comment_text from comments')
    sqldb.commit()
    
data = pd.DataFrame({'comment': comments.fetchall()})
data.reindex() #just index on row number
data.head()


Out[2]:
comment
0 (7521074355.txt Reclassify The Internet As A C...
1 (7521074318.txt Reclassify The Internet As A C...
2 (7521074516.txt Reclassify The Internet As A C...
3 (7521074376.txt Reclassify The Internet As A C...
4 (7521074413.txt Reclassify The Internet As A C...

You can see that the text itself could use a little cleaning up. Each entry is currently a tuple, so we want to convert to a string and then get rid of the page numbers, and remove the text file name.

Since the text file name comes at the beginning of each comment, we can strip that out by splitting:


In [6]:
#Test our parse on one comment first
test_comment = data.loc[0, 'comment'][0] #have to select the first entry bc it's a tuple
test_comment += ' Page 1111'
re.sub(r'\d+.txt', '', test_comment)


Out[6]:
' Reclassify The Internet As A Common Carrier. Page 1 Page 1111'

Now let's remove all instances of 'Page #':


In [7]:
re.sub(r'Page \d+', '', test_comment).strip()


Out[7]:
'7521074355.txt Reclassify The Internet As A Common Carrier.'

Now let's wrap those two things up into a function so that we can run all of our comments through them before sumission to Algorithmia.


In [8]:
def clean(d):
    '''takes a tuple containing a string and returns a string with no file name or page numbers.'''
    c0 = d[0] 
    c1 = re.sub(r'\d+.txt','',c0)
    c2 = re.sub(r'Page \d+', '', c1)
    # one liner: re.sub(r'Page \d+', '', tu[0].split('.txt ')[1]).strip()
    return c2.strip()
clean(data.loc[1,'comment']) #test


Out[8]:
'Reclassify The Internet As A Common Carrier.'

Now let's run the entire set of comments through the cleaning function.


In [9]:
data['comment'] = data['comment'].apply(lambda x: clean(x))
data.loc[1,:] #test


Out[9]:
comment    Reclassify The Internet As A Common Carrier.
Name: 1, dtype: object

Now that we have clean data, let's submit it to Algorithmia. We do have one more thing to check though, and that's to see how it will convert to json. To check we just select the first few.


In [11]:
json.dumps([list(data.loc[1:3,'comment']), num_topics])


Out[11]:
'[["Reclassify The Internet As A Common Carrier.", "Reclassify The Internet As A Common Carrier. Please do the right thing by all Americans!", "Reclassify The Internet As A Common Carrier."], 3]'

Now we want to do our real submission. Since we have so many comments, let's for now just randomly select 1000 and see how long that takes.


In [28]:
#Actual data submission
pop_size = len(data.index) #446719
sample = random.randint(0, pop_size, sample_size) # get a random sample
submission = json.dumps([list(data.loc[sample,'comment']), num_topics])

In [37]:
#Setup the API Call
request = Request(api_base+'kenny/LDA')
request.add_header('Content-Type', 'application/json')
request.add_header('Authorization', api_key)
response = urlopen(request, submission.encode())
result = json.loads(response.read().decode())

In [41]:
result


Out[41]:
{'result': [{'data': 248,
   'net': 239,
   'isps': 483,
   'services': 407,
   'important': 390,
   'equally': 238,
   'internet': 1044,
   'service': 347},
  {'open': 155,
   'net': 204,
   'providers': 182,
   'neutrality': 185,
   'people': 162,
   'access': 167,
   'internet': 679,
   'service': 211},
  {'neutrality': 1356,
   'net': 1345,
   'title': 798,
   'business': 539,
   'isps': 1585,
   'slow': 542,
   'choice': 794,
   'internet': 1858}],
 'time': 25.83614806}

Next up, put those into a topic cluster graph...